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Based only on the information gathered in a snapshot of a directed network, we present a formal 
way of checking if the proposed model is correct for the empirical growing network under study. 
In particular, we show how to estimate the attractiveness, and present an application of the model 
presented in [lj to the scientific publications network from the ISI dataset. 
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PACS numbers: 05.65.+b, 89.75.Kd, 87.23.Ge, 02.50.r, 02.50.O 



The study of networks has attracted many scientists 
during the last decade. The most studied directed grow- 
ing networks are the WWW network [B| , where each node 
represents a web page and the hyper-links (references to 
other web pages) represents the directed edges or links, 
and the scientific papers network where each paper 
is a node, and its references the directed links. Many 
of the theoretical approaches try to determine whether 
a particular model provides a "good description" of the 
dynamics of the empirical growing network under study, 
paying special attention to the stationary state. Empir- 
ical growing networks shows that this state is character- 
ized by the fact that it has a degree distribution with a 
power law tail, where the degree of a vertex is defined as 
the total number of its connections, and a large clustering 
coefficient, which is a measure of how much connected are 
the neighborhoods nodes of random selected node. These 
two descriptive statistical measures characterized in good 
way the network topology. 

Typically, the way of checking if a model mimics the 
real growing network (once the clustering coefficient is 
near the empirical) is comparing the limit degree dis- 
tribution with the empirical degree one, paying special 
attention in the tail of the distribution. In [l| it was 
shown that this measure is not a good one, since many 
different models can give the same tail (typically scale 
invariant). This way, two other informal checks where 
suggested [l[ in order to have a first idea whether the 
model works well. The fist one is based on the relation 
between the variance of the out-degree random variable, 
D out and the in-out degree covariance, Cov(Di n , D out ) = 
E(D in D out ) - E(D in )E(D out ). Where D in is the in- 
degree of a randomly selected node. The second informal 
check is based on the relation between the tail of the in 
degree distribution and the tail of the out degree one. 
In [l[ these two relation were studied in detail for the 
growing model shown in Fig. [TJ 

In this letter, we try to go further this informal checks 
proposing a way to test whether a particular model de- 
scribe well the empirical directed network. Besides, we 
test the model presented in [lj to see if it describe the 
scientific citation network. 



Let us now describe the growing network process 
(Fig. Q]) presented in [lj]: as time evolves new nodes 
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FIG. 1: Scheme of the growing network model. As time 
evolves new nodes with D ou t number of out-links appear, 
these ones attaches to the existing nodes. In this example 
Dam = {1,2}. 



with D out number of out (directed) links appear, which 
connect to the existing nodes according to some prob- 
ability law (uniform, preferential linking, etc.). D out 
is a random variable (or the number of out-links of a 
randomly selected node) with an arbitrary distribution, 
P(D out — j) = Pj with j G N. For this model, the 



limit (in/out) degree distribution (i/? 



and 



was computed. Under preferential linking with attrac- 
tiveness they get: 



Pk 



"in = 



J'=l 



gC? + k + A,3 + S) 



Pj 



n—k,k n.k 

in, out ^deg,out 



^{n + A,3 + 5) 
+ A,2 + S) 



Pk 



(a) 
(b) 

(c) 
(d) 



(1) 



where ^(a,b) 



i>]r[fe] _ f i. 

r[a+b] JO 



E„ = 



' -!(1 - tf^dt , 

One advantage of explicitly 



kp k , and S = A/E a 
fc=i 

knowing the dependence of the v\ n , v^ eg and v\f t ' deg dis- 
tributions as functions of the out-degree distribution is 
that it is now possible to estimate the attractiveness using 
just the empirical marginal distributions. Let us suppose 
that we are given a growing network with a large number 
of nodes, whose distribution we assume close to the limit 
measure. We take a snapshot and from the information 
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gathered in the picture we estimate: 1) the out-degree 
distribution (pj.) by its empirical law, pk 



x , 
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where J^ uf is the number of nodes with k out-links 
at the time the picture was taken, 2) similarly, the in- 



de gree distribution (^^) by 
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-, and 3) E a , by 



E a = Yl kpk- On the other hand, for each model, 
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be can also estimated from the computed limit in-degree 
distribution (eq.[T](b) for the model under study). In this 
case, we define v k n A as v\ n of eq. Q] (b) after replacing 

Pj by pj and (5 by A/E a . Evidently, v k n A depends on A, 

and the proposed consistent attractiveness estimator A 
becomes: 



A = argmin{max\F-p- A (k) 
AeR> k£N 
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where F^ in (fc) is the cumulative distribution, F Miri (fc) = 
fe . ^ 

X^Min- ^4 is a minimum-distance (for the distribu- 

j=o 

tion) estimator, but there exist many other methods 
for estimating A such as maximum likelihood or mo- 
ment method. The advantage of the first method is dis- 
cussed below. Alternatively, the A estimation can be 
carried out using the degree and the out-degree informa- 
tion (A — argmin{max\F„ d A (k) — Fp d (k)\}), or the 

AeR> k ^ N ea ' 
joint information. Clearly, an estimator based in the joint 
distribution is better since there exist different joint dis- 
tributions with the same marginal ones. Now, we test the 
model to see if it mimics the scientific citation network. 
The model (Fig. [I} seems to have all the real ingredients 
of the real dynamics of the growing network. Each node 
represent a scientific publication and the directed links 
the citations. The main point that can be criticized is 
the attachment probability law: in the model it depends 
on the degree (number of papers that cite a random se- 
lected paper plus the number of citations of this article) 
of each node. We discuss another attachment law below. 
Fig. [2] shows the citation distribution of all scientific pub- 
lications in 1981, from the ISI dataset, which were cited 
between 1981 and 1997 (see @). Clearly, this represent 
the in-degree distribution for a growing network process. 
Unfortunately the out-degree distribution (pk), the num- 
ber of cites in a randomly selected paper, has not been 
reported, making a plug-in (see eq. [IJ approach to test 
the growing model impossible. Nevertheless, we adopt 
the following strategy: we assume a geometric out-degree 
distribution pk = p(l — p) k with k £ N a , a preferential 
linking with attractiveness attachment law, and estimate 
A and p by eq. [2] Clearly, the empirical out-degree distri- 
bution may not fall in any parametric family, however a 
good estimated in-degree distribution will be a very use- 
ful result, since the in-degree distribution is a theoretical 
calculation based on the out one. The T statistic achieves 



FIG. 2: Citation distribution for all papers published in 
1981 (from the ISI) cited between 1981 and 1997. The the- 
oretical citation (in-degree) curves are calculated by eq. [T] 
(b) assuming that the out-degree distribution is geometric, 
Pk = p(l — p) k for k £ No. The solid line correspond to eq. [1] 
(b) replacing A by and p by 0.1, the dashed one correspond 
to A = and p = 0.08. 



its minimum when A — and p — 0.14 (see dashed line). 
If we are less ambitious, and disregard the first values of 
the distribution, but require a good match from the 10 th 
citation onwards, we obtain that p = 0.1 and A — work 
remarkably well. Note that the theoretical curve (solid 
line) is extremely similar to the empirical one in almost 
the whole range of the probability. Moreover, in this case 
the mean out-degree (E a ) is 9, which seems to be a good 
guess for the average number of cites for all the scientific 
publications. 

As we discussed before, the attachment probability law 
can be criticized. Perhaps in a better model this law must 
depend on the in-degree (not on the degree) of each node. 
For this case, using the property 1 introduced in 0], it is 
very easy to compute the limit distributions: 
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where k,j £ Nq. This case is specially easy to solve 
because, for a randomly selected node, the number of 
out-links (Dout) and the number of in- links (Di n ) are in- 



dependent random variables (y. 



Note 



that for this attachment law the attractiveness must be 
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FIG. 3: Citation distribution for all papers published in 1981 
(from the ISI) cited between 1981 and 1997. The theoretical 
citation (in-degree) curves are calculated by eq. 0(b). The 
solid line correspond to eq. [3] (b) replacing A and delta by 
different values shown in the legend. 



strictly positive, since if A = we get that the limit 
in-degree probability is v^ n = 8k=i- This result is easy 
is to understand: new papers appear but they can not 
cite (A = 0) papers without previous citations, and in 
this way the scientific network will be formed by almost 
all papers with zero citations and only a few very cited. 
Clearly, in the limit this goes to a delta function. More- 
over, for k >> 1, vf n behaves as k~( 2+s \ Now, in the 
same way as before, we will find A and S such that T 
is minimum. This happens for A = 2.71, and 5 = 0.55 
where T = 0.108 (see Fig. If we are less are less am- 
bitious, and disregard the first values of the distribution, 
but now require a good match from the 30 th citation on- 
wards, we obtain that S = 1 and A = 11 match well the 
tail. 

Although we know that for both attachment probabil- 
ity laws presented the power exponent of the tail of the 
in-degree distribution grows linear with A (fc~( 3 +^V E °) 
or k~( 2+A ' E °'), we do not have much intuition of what is 
the role of attractiveness over the in-degree distribution. 
In order to try to understand this, in Fig. [3] we show 
for different values of the attractiveness. The role of the 
attractiveness is to make flatten the first values of the 
in-degree distribution. The same type of behavior is ob- 
served for the case where the law of attachment depends 
on the degree of the node (data not shown). Once we 
have the figure, the interpretation is immediate since for 
greater A the difference between selecting a node with 
let say 2 citations will be very similar to the one with 1 
or 3 citations (7if„ = (k + A)v*J{E + A)). 

In general a growing network model can be separated 



in two parts: A) the model in its own (e.g. in each tem- 
poral step a new node with D out number of links is aggre- 
gated) , and B) the attachment probability law for the 
new links. We have previously shown two models that 
differ only on part B). It is important to check whether 
a proposed model (A+B) describes well the empirical 
data. If we are convinced that the part A) is correct 
for describing an empirical growing network, we can test 
if the attachment probability law is for example prefer- 
ential linking (or uniform). In the case where you are 
convinced that the attachment law has some known law 
(part B) correct), we can test if the part A) of the model 
is adequate. In the other case, it is also possible to test 
both hypothesis (the model), but if H a is rejected we can 
not determine which (A or B or both) is the incorrect. 

One advantage of the proposed estimator in eq. [2] is 
that it is now very easy to test any of the hypothesis 
previously mentioned. For example H D : the real growing 
network has an underlying link attachment law that is 
preferential with attractiveness. This hypothesis can be 
tested Oil with the usual statistic, T — max\F-p Jk) — 

k£N a zn - A 

Fp in (k)\. This test is one of the main result presented 
here. 

Another important issue is to be able to rank the mod- 
els. For example, in [ill ] a nice growing model was pro- 
posed for the WWW dynamics. In this model, with 
probability p a new node with only one out-link is ag- 
gregated, or a new directed link from an existing node is 
created (with probability 1 — p). This is the part A) of 
the model, and the following constitute the part B) of the 
model. The new link from the new node is attached by 
preferential linking with attractiveness for the in-degree, 
77 in = e + +a v in ( m P3 they use A instead of A and works 
with rates and not with probability) . And the selection of 
the new created link, has two independent events: 1) the 
selection, by preferential linking with attractiveness for 
the out-degree, of the origin, 2) and the selection of the 
target by preferential linking with attractiveness (with 
a different parameter from the previous one) for the in- 
degrec. An alternative model can be the one proposed 
for the scientific network (see Fi^T]), where now the nodes 
represent the web pages and the links the hyper- links. 
Clearly, both models have some weak points. In the first 
one people can not put more than one hyper- link 
when they are constructing their own web page (later 
some new links can appear) . The second model (Fig [1]) 
do not have this inconvenient, but is "static", once the 
hyper-links are fixed they can not be changed. Probably, 
a mixed model between both be more realistic. How to 
compare or rank these model is a relevant question in 
order to approach to the "real model" . There exit many 
statistical measures that do this job, in particular there 
is an extent bibliography for Hidden Markov chain prob- 
lems, and also for linear regression models. We propose a 
very simple ranking variable that is the value of T using 
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the (cumulative) joint distribution. The best model is the 
one that has the minimum value of T. Much work must 
be done to understand how must be the penalization (if 
it is necessary) for models with many parameters. 

In summary, we discussed: 1) how to estimate model 
parameters, showing an application to the scientific pub- 
lications network, and 2) a way of checking whether a 
proposed model is correct, based on the limit (joint) in 
and out degree distribution. This way, the results pre- 
sented here shed some light on the problem of estimating 
the underling attachment law, ranking models, and test 
models in a general way. 
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